Education plays a vital role in our life. Knowing what factors can affect student’s performance on test scores would be helpful to educators. We want to explore the “Student Grade Prediction” data set from Kaggle to understand the influence of the parent’s background, test preparation and other factors on student’s performance. There are three questions we are interested in:
1. What are the fundamental factors that will affect students’ performance on their final grade?
2. Which factors influence poor performance on the final grade the most?
3. What would be the best way to improve student scores on their final grade?
# The data contains of a number of the following fields:
# school Student’s school: 'GP' or 'MS'
# Sex ‘F’ – Female or ‘M’ – Male
# Age ’15 – 22’
# Address home address: ‘U’ – urban or ‘R’ – rural
# Famsize ‘LE3’ – Less than/equal to 3 or ‘GT3’ – Greater than 3
# Pstatus Parent’s cohabitation status: ‘T’ – Living together or ‘A’ – Apart
# Medu education: ‘0’ – None, ‘1’ – Primary, ‘2’ – 5th to 9th grade, ‘3’ – Secondary, or ‘4’ – Higher education
# Fedu education: ‘0’ – None, ‘1’ – Primary, ‘2’ – 5th to 9th grade, ‘3’ – Secondary, or ‘4’ – Higher education
# Mjob ‘teacher’, ‘health’ care related, civil ‘services’ (administrative or police), ‘at_home’, or ‘other’
# Fjob ‘teacher’, ‘health’ care related, civil ‘services’ (administrative or police), ‘at_home’, or ‘other’
# Reason Reason to choose this school: close to ‘home’, school ‘reputation’, ‘course’ preference, or ‘other’
# Guardian Student’s guardian: ‘mother’, ‘father’, or ‘other’
# Traveltime Home to school travel time: ‘1’ - <15 min, ‘2’ – 15 to 30 min, ‘3’ – 30 min to 1 hr, ‘4’ - >1 hr
# Studytime Weekly study time: ‘1’ - <2 hours, ‘2’ – 2 to 5 hours, ‘3’ – 5 to 10 hours, ‘4’ - >10 hours
# Failures Number of past class failures: ‘n’ if 1 < n <3, else 4
# Schoolsup Extra educational support: ‘yes’ or ‘no’
# Famsup Family educational support: ‘yes’ or ‘no’
# Paid Extra paid classes within the course subject: ‘yes’ or ‘no’
# Activities Extra-curricular activities: ‘yes’ or ‘no’
# Nursery Attend nursery school: ‘yes’ or ‘no’
# Higher Wants to take higher education: ‘yes’ or ‘no’
# Internet Internet access at home: ‘yes’ or ‘no’
# Romantic With a romantic relationship: ‘yes’ or ‘no’
# Famrel Quality of family relationships: (1 to 5) 1 – very bad 5 – excellent
# Freetime Free time after school: (1 to 5)
# Goout Going out with friends: (1 to 5)
# Dalc Workday alcohol consumption: (1 to 5)
# Walc Weekend alcohol consumption: (1 to 5)
# Health Current health status: (1 to 5)
# Absences Number of school absences: 0 to 93
# G1 First period grade: 0 to 20
# G2 Second period grade: 0 to 20
# G3 Final grade: 0 to 20 (output target)
colSums(is.na(math))
## school sex age address famsize Pstatus
## 0 0 0 0 0 0
## Medu Fedu Mjob Fjob reason guardian
## 0 0 0 0 0 0
## traveltime studytime failures schoolsup famsup paid
## 0 0 0 0 0 0
## activities nursery higher internet romantic famrel
## 0 0 0 0 0 0
## freetime goout Dalc Walc health absences
## 0 0 0 0 0 0
## G1 G2 G3
## 0 0 0
str(math)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : num 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : num 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : num 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : num 6 6 10 15 10 15 11 6 19 15 ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. failures = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. paid = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double()
## .. )
summary(math)
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason
## Length:395 Length:395 Length:395
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## guardian traveltime studytime failures
## Length:395 Min. :1.000 Min. :1.000 Min. :0.0000
## Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## schoolsup famsup paid
## Length:395 Length:395 Length:395
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## activities nursery higher
## Length:395 Length:395 Length:395
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## internet romantic famrel freetime
## Length:395 Length:395 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:4.000 1st Qu.:3.000
## Mode :character Mode :character Median :4.000 Median :3.000
## Mean :3.944 Mean :3.235
## 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
## goout Dalc Walc health
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :3.000 Median :1.000 Median :2.000 Median :4.000
## Mean :3.109 Mean :1.481 Mean :2.291 Mean :3.554
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 3.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00 1st Qu.: 8.00
## Median : 4.000 Median :11.00 Median :11.00 Median :11.00
## Mean : 5.709 Mean :10.91 Mean :10.71 Mean :10.42
## 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :75.000 Max. :19.00 Max. :19.00 Max. :20.00
math.num <- math %>%
#School: 0 = GP - Gabriel Pereira, 1 = MS – Mousinho da Silveira
mutate(school = ifelse(school == "GP", 0 ,1)) %>%
# sex: 0 = F, 1 = M
mutate(sex = ifelse(sex == "F", 0, 1 )) %>%
# address: 0 = U - Urban, 1 = R - Rural
mutate(address = ifelse(address == "U", 0, 1)) %>%
# famsize: 0 = LE3 - Less than or equal to 3, 1 = GT3 - Greater than 3
mutate(famsize = ifelse(famsize == "LE3",0, 1)) %>%
# Pstatus: 0 = T - Living together, 1 = A - Living apart
mutate(Pstatus = ifelse(Pstatus == "T", 0, 1)) %>%
# Mjob: 0 = 'Teacher', 1 = 'Health' care related, 2 = Civil 'services', 3 = 'at_home', 4 = 'other'
mutate(Mjob=ifelse(Mjob=="teacher",0,
ifelse(Mjob=="health", 1,
ifelse(Mjob=="services", 2,
ifelse(Mjob=="at_home", 3, 4))))) %>%
# Fjob: 0 = 'Teacher', 1 = 'Health' care related, 2 = Civil 'services', 3 = 'at_home', 4 = 'other'
mutate(Fjob=ifelse(Fjob=="teacher",0,
ifelse(Fjob=="health", 1,
ifelse(Fjob=="services", 2,
ifelse(Fjob=="at_home", 3, 4))))) %>%
# reason: 0 = Close to 'home', 1 = school 'reputation', 2 = 'course' preference, 3 = other
mutate(reason=ifelse(reason=="home",0,
ifelse(reason=="reputation", 1,
ifelse(reason=="course", 2,3)))) %>%
# guardian: 0 = mother, 1 = father, 2 = other
mutate(guardian=ifelse(guardian=="mother",0,
ifelse(guardian=="father", 1, 2))) %>%
# schoolsup: 0 = 'no', 1 = 'yes'
mutate(schoolsup = ifelse(schoolsup == "no", 0, 1)) %>%
# famsup: 0 = 'no', 1 = 'yes'
mutate(famsup = ifelse(famsup == "no", 0, 1)) %>%
# paid: 0 = 'no', 1 = 'yes'
mutate(paid = ifelse(paid == "no", 0, 1)) %>%
# activities: 0 = 'no', 1 = 'yes'
mutate(activities = ifelse(activities == "no", 0, 1)) %>%
# nursery: 0 = 'no', 1 = 'yes'
mutate(nursery = ifelse(nursery == "no", 0 , 1)) %>%
# higher: 0 = 'no', 1 = 'yes'
mutate(higher = ifelse(higher == "no", 0, 1)) %>%
# internet: 0 = 'no', 1 = 'yes'
mutate(internet = ifelse(internet == "no", 0, 1)) %>%
# romantic: 0 = 'no', 1 = 'yes'
mutate(romantic = ifelse(romantic == "no", 0, 1))
# To use this method, the above replacement must not use gsub() -BW
math.num <- math.num %>%
mutate_if(is.character, funs(as.numeric))
math.assoc <- math %>%
mutate(G1_bin = ifelse(
G1 < 10, "F", ifelse(
G1 < 12, "D", ifelse(
G1 < 14, "C", ifelse(
G1 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G2_bin = ifelse(
G2 < 10, "F", ifelse(
G2 < 12, "D", ifelse(
G2 < 14, "C", ifelse(
G2 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G3_bin = ifelse(
G3 < 10, "F", ifelse(
G3 < 12, "D", ifelse(
G3 < 14, "C", ifelse(
G3 < 16, "B", "A" ) ) ) ) )
math.assoc <- math.assoc %>%
mutate_all(funs(as.factor))
#check datatype
glimpse(math.assoc)
## Observations: 395
## Variables: 36
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age <fct> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob <fct> teacher, other, other, services, other, other, other,…
## $ reason <fct> course, course, other, home, home, reputation, home, …
## $ guardian <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences <fct> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, …
## $ G1 <fct> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14…
## $ G2 <fct> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16…
## $ G3 <fct> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 1…
## $ G1_bin <fct> F, F, F, B, F, B, C, F, A, B, D, D, B, D, B, B, C, F,…
## $ G2_bin <fct> F, F, F, B, D, B, C, F, A, B, F, C, B, D, A, B, B, D,…
## $ G3_bin <fct> F, F, D, B, D, B, D, F, A, B, F, C, B, D, A, B, B, D,…
# Correlation between G1 and G2
math %>%
ggplot(aes(x=G1, y=G2)) +
geom_jitter(aes(col=G3)) +
scale_color_continuous(low = "red", high = "blue") +
xlab("Score on G1") +
ylab("Score on G2") +
ggtitle("Plot 0. Scores throughout the year")
# How many students passed (10 - 20) and how many failed? (0 - 9)
math %>%
mutate(pass = ifelse(G3 > 9, "pass", "fail")) %>%
ggplot(aes(x = pass, fill = pass)) +
geom_bar() +
xlab("Result at End of Term") +
ylab("Number of Students") +
annotate(geom="text", label = "265", x = 2, y = 250) +
annotate(geom="text", label = "130", x = 1, y = 115) +
annotate(geom="text", label = "67.1%", x = 2, y = 235) +
annotate(geom="text", label = "32.9%", x = 1, y = 100) +
ggtitle("Plot 1. Students Pass Rate") +
theme_light()
# School
math.assoc %>%
ggplot(aes(x=address)) +
geom_bar(aes(fill=reason)) +
geom_text(stat = "count", aes(label=stat(count)), vjust=-1) +
ggtitle("Plot 2. School comparison") +
facet_wrap(~school) +
ylim(0,330) +
theme_light() +
geom_text(data=math.assoc %>% group_by(school) %>% tally(),
aes(x=1.5, y=330, label = paste0( round(n/sum(n), digits = 2)*100, "%")),
vjust = 1.5, size = 5)
# Parent job, education vs grade
#- Mother
math.assoc %>%
ggplot(aes(x=Mjob, y=G3)) +
geom_jitter(aes(col=Medu)) +
ggtitle("Plot 3. Mother's education and mother's job", subtitle = "Male vs Female") +
facet_grid(~sex) +
ylab("Final Grade") +
xlab("Job") +
geom_hline(yintercept = 7) +
annotate(geom="text", label = "Pass", x = 0.8, y = 7.5) +
annotate(geom="text", label = "Fail", x = 0.8, y = 6) +
theme_light()
#- Father
math.assoc %>%
ggplot(aes(x=Fjob, y=G3)) +
geom_jitter(aes(col=Fedu)) +
ggtitle("Plot 4. Father's education and father's job", subtitle = "Male vs Female") +
facet_wrap(~sex) +
ylab("Final Grade") +
xlab("Job") +
geom_hline(yintercept = 7) +
annotate(geom="text", label = "Pass", x = 0.8, y = 7.5) +
annotate(geom="text", label = "Fail", x = 0.8, y = 6) +
theme_light()
# Failures, school support vs grade
math.assoc %>%
ggplot(aes(x=failures, y=G3)) +
geom_jitter(aes(col=sex)) +
ggtitle("Plot 5. Failures vs Final Grade", subtitle = "School educational support") +
facet_wrap(~schoolsup) +
ylab("Final grade") +
xlab("Number of failures") +
theme_light()
# Family size, Family support, parent status, family relationship vs grade
math.assoc %>%
ggplot(aes(x=famsize, y=G3)) +
geom_jitter(aes(col=famrel, shape=Pstatus)) +
ggtitle("Plot 6. Family Size vs Final Grade", subtitle = "Family educational support") +
facet_wrap(~famsup) +
ylab("Final grade") +
xlab("Family size") +
theme_light()
# Goout, freetime, studytime, Walc, Dalc
math.assoc %>%
ggplot(aes(x=studytime, y=G3)) +
geom_jitter(aes(col=goout)) +
ggtitle("Plot 7. Study time vs Final Grade", subtitle = "Free time") +
facet_grid(~freetime) +
ylab("Final grade") +
xlab("Study time") +
scale_color_discrete(name="Go out") +
theme_light()
# Absences, activities, vs grade
math.assoc %>%
ggplot(aes(x=absences, y=G3)) +
geom_jitter(aes(col=activities)) +
ggtitle("Plot 8. Absences vs Final Grade") +
ylab("Final grade") +
scale_color_discrete(name="Extra-curricular activities") +
theme_light()
# Remove grade variables from data set
association <- math.assoc %>%
select(-c(G1,G2,G3))
# Run apriori with setting, sup = 0.05, conf = 0.95, maxlen = 3
rules <- apriori(data = association, parameter = list(sup = 0.2, conf = 0.95, maxlen = 3),
control = list(verbose=F))
# lookup the summary of rule
summary(rules)
# Sort by confidence and lift to see most relevant rules
rules <- sort(rules, by = c("confidence", "lift", "support"), decreasing = TRUE)
# Redundant rules
inspect(rules[is.redundant(rules)][1:20])
## lhs rhs support confidence
## [1] {age=16,higher=yes} => {school=GP} 0.2556962 1.0000000
## [2] {age=16,Pstatus=T} => {school=GP} 0.2405063 1.0000000
## [3] {age=16,address=U} => {school=GP} 0.2354430 1.0000000
## [4] {age=16,failures=0} => {school=GP} 0.2227848 1.0000000
## [5] {age=16,internet=yes} => {school=GP} 0.2227848 1.0000000
## [6] {age=16,schoolsup=no} => {school=GP} 0.2177215 1.0000000
## [7] {age=16,nursery=yes} => {school=GP} 0.2126582 1.0000000
## [8] {age=15,higher=yes} => {school=GP} 0.2075949 1.0000000
## [9] {age=16,guardian=mother} => {school=GP} 0.2000000 1.0000000
## [10] {school=GP,age=15} => {higher=yes} 0.2075949 1.0000000
## [11] {Pstatus=T,paid=yes} => {higher=yes} 0.4151899 0.9939394
## [12] {paid=yes,internet=yes} => {higher=yes} 0.4075949 0.9938272
## [13] {failures=0,paid=yes} => {higher=yes} 0.4000000 0.9937107
## [14] {schoolsup=no,paid=yes} => {higher=yes} 0.4000000 0.9937107
## [15] {paid=yes,nursery=yes} => {higher=yes} 0.3822785 0.9934211
## [16] {school=GP,Walc=1} => {Dalc=1} 0.3544304 0.9929078
## [17] {famsup=yes,paid=yes} => {higher=yes} 0.3493671 0.9928058
## [18] {Pstatus=T,Walc=1} => {Dalc=1} 0.3291139 0.9923664
## [19] {failures=0,Walc=1} => {Dalc=1} 0.3189873 0.9921260
## [20] {address=U,Walc=1} => {Dalc=1} 0.3164557 0.9920635
## lift count
## [1] 1.131805 101
## [2] 1.131805 95
## [3] 1.131805 93
## [4] 1.131805 88
## [5] 1.131805 88
## [6] 1.131805 86
## [7] 1.131805 84
## [8] 1.131805 82
## [9] 1.131805 79
## [10] 1.053333 82
## [11] 1.046949 164
## [12] 1.046831 161
## [13] 1.046709 158
## [14] 1.046709 158
## [15] 1.046404 151
## [16] 1.421009 140
## [17] 1.045755 138
## [18] 1.420235 130
## [19] 1.419890 126
## [20] 1.419801 125
# View the non-redundant rules
inspect(rules[!is.redundant(rules)][1:20])
## lhs rhs support confidence
## [1] {higher=yes,Walc=1} => {Dalc=1} 0.3645570 1
## [2] {romantic=no,Walc=1} => {Dalc=1} 0.2531646 1
## [3] {sex=F,Walc=1} => {Dalc=1} 0.2379747 1
## [4] {activities=yes,Walc=1} => {Dalc=1} 0.2177215 1
## [5] {Fjob=other,Walc=1} => {Dalc=1} 0.2000000 1
## [6] {age=16} => {school=GP} 0.2632911 1
## [7] {age=15} => {school=GP} 0.2075949 1
## [8] {school=GP,paid=yes} => {higher=yes} 0.4075949 1
## [9] {address=U,paid=yes} => {higher=yes} 0.3670886 1
## [10] {guardian=mother,paid=yes} => {higher=yes} 0.3316456 1
## [11] {traveltime=1,activities=yes} => {higher=yes} 0.3316456 1
## [12] {famsize=GT3,paid=yes} => {higher=yes} 0.3291139 1
## [13] {paid=yes,Dalc=1} => {higher=yes} 0.3088608 1
## [14] {paid=yes,romantic=no} => {higher=yes} 0.3037975 1
## [15] {traveltime=1,paid=yes} => {higher=yes} 0.3012658 1
## [16] {sex=F,paid=yes} => {higher=yes} 0.2734177 1
## [17] {Fjob=other,paid=yes} => {higher=yes} 0.2556962 1
## [18] {studytime=2,paid=yes} => {higher=yes} 0.2506329 1
## [19] {sex=F,activities=yes} => {higher=yes} 0.2430380 1
## [20] {Medu=4,Dalc=1} => {higher=yes} 0.2303797 1
## lift count
## [1] 1.431159 144
## [2] 1.431159 100
## [3] 1.431159 94
## [4] 1.431159 86
## [5] 1.431159 79
## [6] 1.131805 104
## [7] 1.131805 82
## [8] 1.053333 161
## [9] 1.053333 145
## [10] 1.053333 131
## [11] 1.053333 131
## [12] 1.053333 130
## [13] 1.053333 122
## [14] 1.053333 120
## [15] 1.053333 119
## [16] 1.053333 108
## [17] 1.053333 101
## [18] 1.053333 99
## [19] 1.053333 96
## [20] 1.053333 91
# plot the rules by scatterplot
plot(rules, measure = c("support", "lift"), shading = "confidence")
# visualize the grouped matrix on the first 20 rules
plot(rules[1:20], method = "grouped")
# plot the first 20 rules
plot(rules[1:20], method="graph", interactive=FALSE, shading=NA)
# Parallel coordinates plot on the first 20 rules
plot(rules[1:20], method="paracoord", reorder=TRUE)
rules.G3.A <-
apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.7, maxlen = 4),
appearance = list(default="lhs",rhs=c("G3_bin=A")),
control = list(verbose=F))
summary(rules.G3.A)
## set of 30 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 1 29
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 3.967 4.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01013 Min. :0.7143 Min. :7.054 Min. :4.000
## 1st Qu.:0.01013 1st Qu.:0.7143 1st Qu.:7.054 1st Qu.:4.000
## Median :0.01266 Median :0.8000 Median :7.900 Median :5.000
## Mean :0.01148 Mean :0.7824 Mean :7.726 Mean :4.533
## 3rd Qu.:0.01266 3rd Qu.:0.8000 3rd Qu.:7.900 3rd Qu.:5.000
## Max. :0.01266 Max. :1.0000 Max. :9.875 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## association[, -(31:32)] 395 0.01 0.7
rules.G3.A <- sort(rules.G3.A, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.A[!is.redundant(rules.G3.A)])
## lhs rhs support confidence lift count
## [1] {Mjob=teacher,
## reason=course,
## freetime=5} => {G3_bin=A} 0.01265823 1.0000000 9.875000 5
## [2] {Mjob=teacher,
## famsup=no,
## freetime=5} => {G3_bin=A} 0.01012658 1.0000000 9.875000 4
## [3] {sex=M,
## Mjob=teacher,
## freetime=5} => {G3_bin=A} 0.01265823 0.8333333 8.229167 5
## [4] {Mjob=teacher,
## schoolsup=no,
## freetime=5} => {G3_bin=A} 0.01265823 0.8333333 8.229167 5
## [5] {Medu=4,
## reason=course,
## freetime=5} => {G3_bin=A} 0.01265823 0.8333333 8.229167 5
## [6] {Medu=4,
## Mjob=services,
## Fjob=teacher} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [7] {Mjob=services,
## Fjob=teacher,
## failures=0} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [8] {Fjob=teacher,
## reason=course,
## activities=yes} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [9] {Fjob=teacher,
## traveltime=1,
## famsup=no} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [10] {Mjob=teacher,
## activities=yes,
## freetime=5} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [11] {Mjob=teacher,
## paid=no,
## freetime=5} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [12] {Mjob=teacher,
## romantic=no,
## freetime=5} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [13] {age=15,
## Medu=4,
## freetime=5} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [14] {sex=M,
## Pstatus=A,
## absences=0} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [15] {famsize=LE3,
## studytime=3,
## famrel=5} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [16] {famsize=LE3,
## studytime=3,
## Walc=1} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [17] {famsize=LE3,
## studytime=3,
## Dalc=1} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [18] {Medu=4,
## famrel=5,
## health=4} => {G3_bin=A} 0.01012658 0.8000000 7.900000 4
## [19] {Mjob=teacher,
## freetime=5} => {G3_bin=A} 0.01265823 0.7142857 7.053571 5
## [20] {Mjob=services,
## Fjob=teacher,
## paid=no} => {G3_bin=A} 0.01265823 0.7142857 7.053571 5
## [21] {address=U,
## Mjob=services,
## Fjob=teacher} => {G3_bin=A} 0.01265823 0.7142857 7.053571 5
## [22] {Mjob=services,
## Fjob=teacher,
## nursery=yes} => {G3_bin=A} 0.01265823 0.7142857 7.053571 5
# visualize the grouped matrix on the first 20 rules
plot(rules.G3.A[1:20], method = "grouped")
rules.G3.B <-
apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.8, maxlen = 4),
appearance = list(default="lhs",rhs=c("G3_bin=B")),
control = list(verbose=F))
summary(rules.G3.B)
## set of 26 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 1 25
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 3.962 4.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01013 Min. :0.8000 Min. :5.267 Min. :4.000
## 1st Qu.:0.01013 1st Qu.:0.8000 1st Qu.:5.267 1st Qu.:4.000
## Median :0.01013 Median :0.8000 Median :5.267 Median :4.000
## Mean :0.01052 Mean :0.8205 Mean :5.402 Mean :4.154
## 3rd Qu.:0.01013 3rd Qu.:0.8000 3rd Qu.:5.267 3rd Qu.:4.000
## Max. :0.01266 Max. :1.0000 Max. :6.583 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## association[, -(31:32)] 395 0.01 0.8
rules.G3.B <- sort(rules.G3.B, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.B[!is.redundant(rules.G3.B)])
## lhs rhs support confidence lift count
## [1] {famsize=GT3,
## Mjob=health,
## studytime=3} => {G3_bin=B} 0.01012658 1.0000000 6.583333 4
## [2] {Medu=4,
## Fedu=2,
## studytime=3} => {G3_bin=B} 0.01012658 1.0000000 6.583333 4
## [3] {activities=yes,
## goout=4,
## absences=6} => {G3_bin=B} 0.01265823 0.8333333 5.486111 5
## [4] {studytime=3,
## paid=no,
## Walc=3} => {G3_bin=B} 0.01265823 0.8333333 5.486111 5
## [5] {Medu=3,
## famsup=no,
## Walc=3} => {G3_bin=B} 0.01265823 0.8333333 5.486111 5
## [6] {Fedu=4,
## nursery=no,
## Dalc=1} => {G3_bin=B} 0.01265823 0.8333333 5.486111 5
## [7] {Mjob=health,
## studytime=3} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [8] {Medu=2,
## goout=4,
## absences=6} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [9] {sex=M,
## goout=4,
## absences=6} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [10] {Mjob=services,
## traveltime=1,
## absences=6} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [11] {sex=M,
## Medu=2,
## absences=6} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [12] {Medu=4,
## Mjob=health,
## health=3} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [13] {Medu=4,
## activities=no,
## absences=4} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [14] {internet=no,
## Walc=2,
## absences=2} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [15] {age=17,
## studytime=3,
## Walc=3} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
## [16] {sex=M,
## internet=no,
## Walc=2} => {G3_bin=B} 0.01012658 0.8000000 5.266667 4
rules.G3.C <-
apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.8, maxlen = 4),
appearance = list(default="lhs",rhs=c("G3_bin=C")),
control = list(verbose=F))
summary(rules.G3.C)
## set of 36 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 1 35
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 3.972 4.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01013 Min. :0.8000 Min. :5.097 Min. :4.000
## 1st Qu.:0.01013 1st Qu.:0.8000 1st Qu.:5.097 1st Qu.:4.000
## Median :0.01013 Median :0.8000 Median :5.097 Median :4.000
## Mean :0.01027 Mean :0.8574 Mean :5.463 Mean :4.056
## 3rd Qu.:0.01013 3rd Qu.:1.0000 3rd Qu.:6.371 3rd Qu.:4.000
## Max. :0.01266 Max. :1.0000 Max. :6.371 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## association[, -(31:32)] 395 0.01 0.8
rules.G3.C <- sort(rules.G3.C, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.C[!is.redundant(rules.G3.C)])
## lhs rhs support confidence lift count
## [1] {Mjob=other,
## Dalc=5} => {G3_bin=C} 0.01012658 1.0000000 6.370968 4
## [2] {famsize=GT3,
## failures=0,
## famrel=1} => {G3_bin=C} 0.01012658 1.0000000 6.370968 4
## [3] {school=GP,
## Fjob=at_home,
## freetime=4} => {G3_bin=C} 0.01012658 1.0000000 6.370968 4
## [4] {Mjob=other,
## famsup=yes,
## Walc=5} => {G3_bin=C} 0.01012658 1.0000000 6.370968 4
## [5] {Medu=3,
## failures=1,
## freetime=5} => {G3_bin=C} 0.01012658 1.0000000 6.370968 4
## [6] {Mjob=other,
## higher=yes,
## Walc=5} => {G3_bin=C} 0.01265823 0.8333333 5.309140 5
## [7] {address=R,
## famsize=LE3,
## Medu=1} => {G3_bin=C} 0.01265823 0.8333333 5.309140 5
## [8] {famsize=GT3,
## internet=yes,
## famrel=1} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [9] {school=GP,
## famsize=GT3,
## famrel=1} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [10] {failures=0,
## internet=yes,
## famrel=1} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [11] {school=GP,
## failures=0,
## famrel=1} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [12] {Fjob=at_home,
## traveltime=1,
## freetime=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [13] {Mjob=other,
## Fjob=other,
## Walc=5} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [14] {Mjob=other,
## failures=0,
## Walc=5} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [15] {Mjob=health,
## failures=0,
## Walc=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [16] {Mjob=health,
## internet=yes,
## Walc=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [17] {Pstatus=T,
## Mjob=health,
## Walc=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [18] {famsize=LE3,
## reason=other,
## famrel=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [19] {Medu=3,
## Fjob=services,
## freetime=5} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [20] {Medu=3,
## activities=no,
## freetime=5} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [21] {reason=course,
## traveltime=2,
## freetime=5} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [22] {address=R,
## famsize=LE3,
## Mjob=at_home} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [23] {Medu=2,
## guardian=father,
## freetime=2} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [24] {Mjob=other,
## traveltime=2,
## freetime=2} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [25] {Dalc=2,
## Walc=3,
## absences=2} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [26] {address=R,
## reason=reputation,
## internet=no} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [27] {studytime=1,
## famrel=3,
## freetime=4} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [28] {Medu=2,
## reason=course,
## Walc=3} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [29] {sex=M,
## Medu=3,
## Walc=2} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [30] {address=R,
## Fjob=services,
## guardian=father} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
## [31] {Mjob=services,
## goout=3,
## health=3} => {G3_bin=C} 0.01012658 0.8000000 5.096774 4
rules.G3.D <-
apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.95, maxlen = 4),
appearance = list(default="lhs",rhs=c("G3_bin=D")),
control = list(verbose=F))
summary(rules.G3.D)
## set of 40 rules
##
## rule length distribution (lhs + rhs):sizes
## 4
## 40
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 4 4 4 4 4
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01013 Min. :1 Min. :3.835 Min. :4.000
## 1st Qu.:0.01013 1st Qu.:1 1st Qu.:3.835 1st Qu.:4.000
## Median :0.01013 Median :1 Median :3.835 Median :4.000
## Mean :0.01057 Mean :1 Mean :3.835 Mean :4.175
## 3rd Qu.:0.01013 3rd Qu.:1 3rd Qu.:3.835 3rd Qu.:4.000
## Max. :0.01519 Max. :1 Max. :3.835 Max. :6.000
##
## mining info:
## data ntransactions support confidence
## association[, -(31:32)] 395 0.01 0.95
rules.G3.D <- sort(rules.G3.D, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.D[!is.redundant(rules.G3.D)])
## lhs rhs support confidence lift count
## [1] {Medu=3,
## schoolsup=yes,
## activities=yes} => {G3_bin=D} 0.01518987 1 3.834951 6
## [2] {Medu=3,
## Fedu=3,
## schoolsup=yes} => {G3_bin=D} 0.01265823 1 3.834951 5
## [3] {address=R,
## guardian=mother,
## absences=4} => {G3_bin=D} 0.01265823 1 3.834951 5
## [4] {Fedu=1,
## reason=home,
## absences=2} => {G3_bin=D} 0.01265823 1 3.834951 5
## [5] {Fedu=1,
## paid=yes,
## absences=2} => {G3_bin=D} 0.01265823 1 3.834951 5
## [6] {Fedu=2,
## reason=reputation,
## romantic=yes} => {G3_bin=D} 0.01265823 1 3.834951 5
## [7] {traveltime=4,
## paid=no,
## internet=yes} => {G3_bin=D} 0.01012658 1 3.834951 4
## [8] {age=17,
## romantic=no,
## goout=1} => {G3_bin=D} 0.01012658 1 3.834951 4
## [9] {Mjob=services,
## studytime=1,
## goout=1} => {G3_bin=D} 0.01012658 1 3.834951 4
## [10] {sex=M,
## studytime=1,
## goout=1} => {G3_bin=D} 0.01012658 1 3.834951 4
## [11] {famsize=LE3,
## activities=no,
## goout=1} => {G3_bin=D} 0.01012658 1 3.834951 4
## [12] {studytime=4,
## famsup=no,
## goout=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [13] {famrel=5,
## goout=5,
## absences=6} => {G3_bin=D} 0.01012658 1 3.834951 4
## [14] {school=MS,
## famsize=GT3,
## Mjob=at_home} => {G3_bin=D} 0.01012658 1 3.834951 4
## [15] {school=MS,
## Dalc=1,
## absences=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [16] {reason=reputation,
## failures=1,
## goout=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [17] {schoolsup=yes,
## famrel=5,
## absences=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [18] {Medu=3,
## schoolsup=yes,
## goout=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [19] {Medu=3,
## reason=reputation,
## schoolsup=yes} => {G3_bin=D} 0.01012658 1 3.834951 4
## [20] {Medu=1,
## guardian=mother,
## absences=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [21] {age=17,
## Walc=3,
## absences=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [22] {Mjob=other,
## Walc=2,
## absences=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [23] {freetime=3,
## goout=2,
## absences=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [24] {Mjob=at_home,
## freetime=3,
## absences=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [25] {Mjob=at_home,
## reason=reputation,
## Walc=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [26] {Mjob=at_home,
## Dalc=1,
## Walc=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [27] {famsize=GT3,
## Mjob=at_home,
## Walc=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [28] {Mjob=at_home,
## reason=reputation,
## goout=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [29] {Medu=1,
## studytime=2,
## health=4} => {G3_bin=D} 0.01012658 1 3.834951 4
## [30] {age=18,
## Medu=1,
## reason=home} => {G3_bin=D} 0.01012658 1 3.834951 4
## [31] {age=15,
## Medu=1,
## paid=yes} => {G3_bin=D} 0.01012658 1 3.834951 4
## [32] {Medu=1,
## reason=course,
## paid=yes} => {G3_bin=D} 0.01012658 1 3.834951 4
## [33] {sex=F,
## age=18,
## absences=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [34] {age=17,
## Walc=1,
## absences=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [35] {reason=home,
## traveltime=2,
## absences=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [36] {age=16,
## paid=yes,
## internet=no} => {G3_bin=D} 0.01012658 1 3.834951 4
## [37] {age=15,
## activities=yes,
## Dalc=2} => {G3_bin=D} 0.01012658 1 3.834951 4
## [38] {age=16,
## reason=reputation,
## Walc=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [39] {address=R,
## Fedu=3,
## health=3} => {G3_bin=D} 0.01012658 1 3.834951 4
## [40] {address=R,
## Mjob=services,
## Fjob=services} => {G3_bin=D} 0.01012658 1 3.834951 4
rules.G3.F <-
apriori(data = association[,-(31:32)], parameter = list(sup = 0.03, conf = 0.8, maxlen = 3),
appearance = list(default="lhs",rhs=c("G3_bin=F")),
control = list(verbose=F))
summary(rules.G3.F)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 1 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.875 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.03038 Min. :0.8000 Min. :2.431 Min. :12.00
## 1st Qu.:0.03038 1st Qu.:0.8000 1st Qu.:2.431 1st Qu.:12.00
## Median :0.03038 Median :0.8180 Median :2.486 Median :12.00
## Mean :0.03133 Mean :0.8259 Mean :2.510 Mean :12.38
## 3rd Qu.:0.03101 3rd Qu.:0.8571 3rd Qu.:2.604 3rd Qu.:12.25
## Max. :0.03544 Max. :0.8571 Max. :2.604 Max. :14.00
##
## mining info:
## data ntransactions support confidence
## association[, -(31:32)] 395 0.03 0.8
rules.G3.F <- sort(rules.G3.F, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.F[!is.redundant(rules.G3.F)])
## lhs rhs support confidence lift
## [1] {failures=3,schoolsup=no} => {G3_bin=F} 0.03037975 0.8571429 2.604396
## [2] {school=GP,failures=2} => {G3_bin=F} 0.03037975 0.8571429 2.604396
## [3] {goout=5,absences=0} => {G3_bin=F} 0.03037975 0.8571429 2.604396
## [4] {failures=2} => {G3_bin=F} 0.03544304 0.8235294 2.502262
## count
## [1] 12
## [2] 12
## [3] 12
## [4] 14
math.stand<- math.num %>% scale()
Minimizes the intra-cluster variation. The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be the value that maximize the gap statistic.
gs.graph <- fviz_nbclust(math.stand, kmeans, method = "gap_stat") +
labs(subtitle = "Gap Statistic Method")
gs.graph
The Elbow Method to define clusters such that the total intra-cluster variation, or total within-cluster sum of square (WSS), is minimized. The optimal number of clusters is the k value such that adding an additional cluster will not improve WSS by a significant amount.
elbow <- fviz_nbclust(math.stand, kmeans, method = "wss") +
labs(subtitle = "Elbow Method")
elbow
set.seed(23)
# Function to compute totla within-cluster sum of square
wss <- function(x){
kmeans(math.num, x, nstart = 10)$tot.withinss
}
# Compute and plot wss for k = 1 to k = 10 (or 15)
k.values <- 1:10
# Extract wss for 2 - 15 clusters
wss.values <- map_dbl(k.values, wss)
# Create a dataframe of k.values and wss.values to plotf
wss.df <- as.data.frame(k.values,wss.values)
# Elbow graph
elbow2 <- ggplot(data = wss.df, aes(x = k.values, y = wss.values)) +
xlab("Number of Clusters, K") +
ylab("Total Within-Clusters Sum of Squares") +
geom_point() +
geom_line() +
geom_vline(xintercept = 4, linetype = "dashed") +
ggtitle("Optimal number of clusters", subtitle = "Elbow Method") +
theme_classic()
elbow2
Silhouette measures how well an observation is clustered and approximates the average distance between clusters. Silhouette values fall between -1 and 1. A negative value suggests the data point is in the wrong cluster, a value near 0 means the data point is between two clusters, and values closer to 1 mean the data point is in the proper cluster.
sil <- fviz_nbclust(math.stand, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette Method")
grid.arrange(gs.graph, elbow2, sil, nrow = 3)
set.seed(19)
math.k2 <- kmeans(math.num,
centers = 2,
nstart = 25)
math.k2
## K-means clustering with 2 clusters of sizes 64, 331
##
## Cluster means:
## school sex age address famsize Pstatus Medu
## 1 0.0625000 0.4218750 17.31250 0.1875000 0.6406250 0.18750000 2.906250
## 2 0.1268882 0.4833837 16.57704 0.2296073 0.7250755 0.08761329 2.719033
## Fedu Mjob Fjob reason guardian traveltime studytime
## 1 2.562500 2.546875 3.031250 1.046875 0.5156250 1.375000 1.828125
## 2 2.513595 2.471299 2.942598 1.317221 0.3655589 1.462236 2.075529
## failures schoolsup famsup paid activities nursery higher
## 1 0.5625000 0.1250000 0.6718750 0.4375000 0.4375000 0.8125000 0.9375000
## 2 0.2900302 0.1299094 0.6012085 0.4622356 0.5226586 0.7915408 0.9516616
## internet romantic famrel freetime goout Dalc Walc
## 1 0.8906250 0.4531250 3.796875 3.265625 3.312500 1.765625 2.796875
## 2 0.8217523 0.3111782 3.972810 3.229607 3.069486 1.425982 2.193353
## health absences G1 G2 G3
## 1 3.468750 19.234375 10.4375 10.06250 10.06250
## 2 3.570997 3.093656 11.0000 10.83988 10.48338
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
## [36] 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## [71] 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
## [211] 2 1 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2
## [246] 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2
## [281] 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 1
## [316] 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1
## [386] 2 2 2 2 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 11288.34 26085.58
## (between_SS / total_SS = 27.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
math.k.plot <- math %>%
mutate(cluster2 = as.factor(math.k2$cluster)) %>%
mutate(id = 1:nrow(math))
p2 <- fviz_cluster(math.k2,
data = math.num)
p2
Rule with higher education appears many times - Students at age 15 want to get higher education - Students whose mother has higher education level tend to get higher education as well - Female wants to get higher education
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
geom_jitter(aes(color = cluster2)) +
facet_grid(~Mjob) +
xlab("Mother's Education Level") +
ylab("Final Grade") +
ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
theme_light()
Mjob: 0 = ‘Teacher’, 1 = ‘Health’ care related, 2 = Civil ‘services’, 3 = ‘at_home’, 4 = ‘other’
The majority of children whose mothers are teachers or in the health care field score well on their final grade. ### k = 2
set.seed(19)
math.k2 <- kmeans(math.num,
centers = 2,
nstart = 25)
math.k2
## K-means clustering with 2 clusters of sizes 64, 331
##
## Cluster means:
## school sex age address famsize Pstatus Medu
## 1 0.0625000 0.4218750 17.31250 0.1875000 0.6406250 0.18750000 2.906250
## 2 0.1268882 0.4833837 16.57704 0.2296073 0.7250755 0.08761329 2.719033
## Fedu Mjob Fjob reason guardian traveltime studytime
## 1 2.562500 2.546875 3.031250 1.046875 0.5156250 1.375000 1.828125
## 2 2.513595 2.471299 2.942598 1.317221 0.3655589 1.462236 2.075529
## failures schoolsup famsup paid activities nursery higher
## 1 0.5625000 0.1250000 0.6718750 0.4375000 0.4375000 0.8125000 0.9375000
## 2 0.2900302 0.1299094 0.6012085 0.4622356 0.5226586 0.7915408 0.9516616
## internet romantic famrel freetime goout Dalc Walc
## 1 0.8906250 0.4531250 3.796875 3.265625 3.312500 1.765625 2.796875
## 2 0.8217523 0.3111782 3.972810 3.229607 3.069486 1.425982 2.193353
## health absences G1 G2 G3
## 1 3.468750 19.234375 10.4375 10.06250 10.06250
## 2 3.570997 3.093656 11.0000 10.83988 10.48338
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
## [36] 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## [71] 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
## [211] 2 1 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2
## [246] 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2
## [281] 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 1
## [316] 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1
## [386] 2 2 2 2 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 11288.34 26085.58
## (between_SS / total_SS = 27.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
math.k.plot <- math %>%
mutate(cluster2 = as.factor(math.k2$cluster)) %>%
mutate(id = 1:nrow(math))
p2 <- fviz_cluster(math.k2,
data = math.num)
p2
Rule with higher education appears many times - Students at age 15 want to get higher education - Students whose mother has higher education level tend to get higher education as well - Female wants to get higher education
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
geom_jitter(aes(color = cluster2)) +
facet_grid(~Mjob) +
xlab("Mother's Education Level") +
ylab("Final Grade") +
ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
theme_light()
Mjob: 0 = ‘Teacher’, 1 = ‘Health’ care related, 2 = Civil ‘services’, 3 = ‘at_home’, 4 = ‘other’
The majority of children whose mothers are teachers or in the health care field score well on their final grade.
set.seed(19)
math.k3 <- kmeans(math.num,
centers = 3,
nstart = 25)
math.k3
## K-means clustering with 3 clusters of sizes 156, 57, 182
##
## Cluster means:
## school sex age address famsize Pstatus Medu
## 1 0.13461538 0.4038462 16.73718 0.2628205 0.7564103 0.07692308 2.461538
## 2 0.05263158 0.4210526 17.28070 0.1578947 0.6666667 0.15789474 2.964912
## 3 0.12087912 0.5494505 16.47802 0.2087912 0.6868132 0.10989011 2.928571
## Fedu Mjob Fjob reason guardian traveltime studytime
## 1 2.288462 2.653846 3.115385 1.346154 0.4230769 1.551282 2.000000
## 2 2.614035 2.508772 3.017544 1.000000 0.4561404 1.368421 1.754386
## 3 2.692308 2.329670 2.802198 1.296703 0.3406593 1.384615 2.153846
## failures schoolsup famsup paid activities nursery higher
## 1 0.54487179 0.19230769 0.6474359 0.4166667 0.5000000 0.7948718 0.9102564
## 2 0.56140351 0.14035088 0.6842105 0.4210526 0.4561404 0.8245614 0.9298246
## 3 0.08241758 0.07142857 0.5604396 0.5054945 0.5329670 0.7857143 0.9890110
## internet romantic famrel freetime goout Dalc Walc
## 1 0.7884615 0.3397436 4.006410 3.211538 3.269231 1.455128 2.314103
## 2 0.9122807 0.4912281 3.754386 3.192982 3.298246 1.701754 2.754386
## 3 0.8461538 0.2802198 3.950549 3.269231 2.912088 1.434066 2.126374
## health absences G1 G2 G3
## 1 3.557692 3.153846 8.108974 7.602564 6.673077
## 2 3.438596 20.122807 10.315789 9.894737 9.877193
## 3 3.587912 3.384615 13.494505 13.637363 13.791209
##
## Clustering vector:
## [1] 1 1 1 3 1 3 3 1 3 3 1 3 3 1 3 3 3 1 2 1 3 3 3 3 1 2 3 3 3 2 3 3 3 1 3
## [36] 1 3 3 3 3 2 3 3 1 2 1 2 3 3 1 3 3 3 1 3 1 3 3 1 3 3 1 1 1 1 3 3 1 1 3
## [71] 3 1 1 3 2 1 3 3 1 1 3 3 1 3 1 1 1 3 2 2 1 3 1 1 3 1 3 1 3 1 2 3 3 2 3
## [106] 3 1 3 3 3 3 1 3 3 1 3 3 3 2 3 3 3 3 2 1 3 1 1 1 3 1 1 2 2 1 1 1 1 3 3
## [141] 1 1 3 3 1 1 1 3 1 1 1 3 1 1 3 1 3 1 3 3 1 1 1 1 1 2 1 3 1 3 1 3 3 1 1
## [176] 1 3 1 1 1 1 3 3 2 2 2 3 3 1 1 3 1 1 1 3 3 3 1 2 1 3 1 1 2 1 2 1 3 1 1
## [211] 1 2 3 2 2 3 2 2 1 1 1 1 3 3 3 2 3 3 2 3 2 3 2 3 2 1 3 2 3 1 2 3 1 3 1
## [246] 3 3 2 1 3 1 1 1 1 3 1 3 2 3 1 2 1 3 1 1 3 1 3 1 1 2 3 3 3 1 3 2 2 2 1
## [281] 2 2 3 1 1 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 2 3 3 3 2 3 3 2 3 2 1 2 3 2 2
## [316] 2 1 1 3 3 2 2 3 3 3 3 3 1 1 3 1 3 1 1 1 2 3 1 3 1 3 1 3 1 1 3 3 1 3 3
## [351] 1 3 1 1 3 1 3 3 1 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 3 1 3 1 3 2 3 1 3 1 2
## [386] 1 1 1 1 1 1 3 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 9008.250 10322.596 8303.434
## (between_SS / total_SS = 46.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
p3 <- fviz_cluster(math.k3,
data = math.num)
p3
math.k.plot <- math.k.plot %>%
mutate(cluster3 = as.factor(math.k3$cluster))
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
geom_jitter(aes(color = cluster3)) +
facet_grid(~Mjob) +
xlab("Mother's Education Level") +
ylab("Final Grade") +
ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust=-1.5) +
annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 1) +
theme_light()
ggplot(data = math.k.plot, aes(x = schoolsup, y = G3)) +
geom_jitter(aes(color = cluster3)) +
facet_grid(~failures) +
xlab("Student") +
ylab("Final Grade") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 0) +
ggtitle("School Support vs Final Score", subtitle = "Number of Failures") +
theme_light()
ggplot(data = math.k.plot, aes(x = Walc, y = G3)) +
geom_jitter(aes(color = cluster3)) +
facet_grid(~Mjob) +
xlab("Weekend Alcohol Consumption") +
ylab("Final Grade") +
ggtitle("Weekend Alcohol Consumption vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 0) +
theme_light()
The students seem to be clustered by the performance on G3. Students who scored above 11 are in Cluster 3, 4-18 in Cluster 2 and 0-12 in cluster 3.
ggplot(data = math.k.plot, aes(x = famrel, y = G3)) +
geom_jitter(aes(color = cluster3)) +
facet_grid(~freetime) +
xlab("Family Relationship") +
ylab("Final Grade") +
ggtitle("Family Relationship vs Final Grade", subtitle = "Amount of Freetime") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
annotate(geom="text", label = "Fail", x = 1.2, y = 8.5, vjust = 0) +
theme_light()
set.seed(19)
math.k4 <- kmeans(math.num,
centers = 4,
nstart = 25)
math.k4
## K-means clustering with 4 clusters of sizes 5, 138, 173, 79
##
## Cluster means:
## school sex age address famsize Pstatus Medu
## 1 0.00000000 0.2000000 17.80000 0.4000000 0.8000000 0.20000000 3.000000
## 2 0.14492754 0.3768116 16.73188 0.2536232 0.7608696 0.07971014 2.391304
## 3 0.11560694 0.5664740 16.42775 0.2023121 0.6763006 0.09826590 2.913295
## 4 0.07594937 0.4556962 17.15190 0.2025316 0.6962025 0.15189873 3.000000
## Fedu Mjob Fjob reason guardian traveltime studytime
## 1 3.000000 3.200000 2.800000 0.600000 0.8000000 1.400000 2.000000
## 2 2.253623 2.746377 3.101449 1.326087 0.3913043 1.536232 1.985507
## 3 2.722543 2.335260 2.791908 1.283237 0.3352601 1.381503 2.144509
## 4 2.518987 2.303797 3.075949 1.202532 0.4810127 1.443038 1.886076
## failures schoolsup famsup paid activities nursery higher
## 1 0.40000000 0.20000000 0.6000000 0.4000000 0.4000000 0.8000000 0.8000000
## 2 0.56521739 0.21014493 0.6304348 0.3768116 0.4782609 0.7826087 0.8985507
## 3 0.06936416 0.06936416 0.5838150 0.5144509 0.5375723 0.7803468 0.9884393
## 4 0.50632911 0.11392405 0.6455696 0.4810127 0.5063291 0.8481013 0.9620253
## internet romantic famrel freetime goout Dalc Walc
## 1 1.0000000 0.8000000 4.200000 2.200000 2.600000 1.400000 2.000000
## 2 0.7681159 0.3405797 4.000000 3.188406 3.246377 1.376812 2.217391
## 3 0.8497110 0.2658960 3.994220 3.271676 2.919075 1.410405 2.127168
## 4 0.8987342 0.4430380 3.721519 3.303797 3.316456 1.822785 2.797468
## health absences G1 G2 G3
## 1 3.600000 52.600000 10.200000 10.000000 9.40000
## 2 3.572464 2.391304 7.985507 7.442029 6.34058
## 3 3.589595 2.965318 13.560694 13.728324 13.87861
## 4 3.443038 14.544304 10.253165 9.873418 10.01266
##
## Clustering vector:
## [1] 2 2 4 3 2 3 3 2 3 3 2 3 3 3 3 3 3 2 4 2 3 3 3 3 2 4 3 3 3 4 3 3 3 2 3
## [36] 2 3 3 3 3 4 3 3 2 4 2 4 3 3 2 3 3 3 2 3 2 3 3 2 3 3 2 2 2 2 3 3 2 2 3
## [71] 3 2 2 3 1 2 4 3 2 4 3 3 2 3 2 2 2 3 4 4 2 3 2 2 3 2 3 2 3 2 4 3 3 4 3
## [106] 4 2 3 3 3 3 2 3 3 2 3 3 3 4 3 3 3 3 4 2 3 2 2 2 3 2 2 4 4 2 2 2 2 3 3
## [141] 2 2 3 3 2 2 2 3 2 2 2 3 4 2 3 2 3 2 3 3 2 2 2 2 2 4 2 3 2 3 2 3 3 2 2
## [176] 2 3 2 4 2 4 3 3 1 4 4 3 3 2 2 4 2 4 4 3 3 3 4 4 2 3 2 2 4 3 4 2 4 2 2
## [211] 4 4 3 4 4 3 4 4 2 2 2 2 3 3 3 4 3 3 4 4 4 3 4 3 4 4 3 4 3 2 4 3 2 3 2
## [246] 3 3 4 2 3 2 2 2 2 3 2 3 4 3 2 4 2 3 2 2 3 2 4 4 2 4 3 3 3 2 3 1 4 4 4
## [281] 4 4 3 2 2 3 3 3 3 3 4 3 4 3 3 3 2 4 3 3 4 3 3 3 4 3 3 1 3 4 2 4 3 4 4
## [316] 1 2 4 3 3 4 4 3 3 3 3 3 4 2 3 2 3 2 2 2 4 4 2 3 2 3 2 3 2 2 3 3 2 3 4
## [351] 2 3 2 2 3 2 3 3 2 3 3 3 3 3 3 2 3 2 2 4 2 3 3 4 3 2 3 2 3 4 3 2 3 2 4
## [386] 2 2 2 2 2 4 3 2 3 2
##
## Within cluster sum of squares by cluster:
## [1] 999.600 7538.819 7441.156 5379.595
## (between_SS / total_SS = 58.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
p4 <- fviz_cluster(math.k4,
data = math.num)
p4
math.k.plot <- math.k.plot %>%
mutate(cluster4 = as.factor(math.k4$cluster))
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
geom_jitter(aes(color = cluster4)) +
facet_grid(~Mjob) +
xlab("Mother's Education Level") +
ylab("Final Grade") +
ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1)
ggplot(data = math.k.plot, aes(x = schoolsup, y = G3)) +
geom_jitter(aes(color = cluster4)) +
facet_grid(~failures) +
xlab("Student") +
ylab("Final Grade") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust = -1.5) +
annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 0)
If a student fails a class 2 or more times, they are likely to fail it again.
set.seed(19)
math.k5 <- kmeans(math.num,
centers = 5,
nstart = 25)
math.k5
## K-means clustering with 5 clusters of sizes 148, 5, 127, 72, 43
##
## Cluster means:
## school sex age address famsize Pstatus Medu
## 1 0.15540541 0.4324324 16.56757 0.2770270 0.7229730 0.09459459 2.547297
## 2 0.00000000 0.2000000 17.80000 0.4000000 0.8000000 0.20000000 3.000000
## 3 0.10236220 0.5826772 16.40945 0.1653543 0.6771654 0.09448819 3.015748
## 4 0.08333333 0.4305556 17.18056 0.1527778 0.6805556 0.16666667 2.958333
## 5 0.09302326 0.3953488 17.04651 0.3023256 0.8139535 0.04651163 2.279070
## Fedu Mjob Fjob reason guardian traveltime studytime
## 1 2.405405 2.608108 3.067568 1.351351 0.3716216 1.554054 2.040541
## 2 3.000000 3.200000 2.800000 0.600000 0.8000000 1.400000 2.000000
## 3 2.748031 2.236220 2.740157 1.267717 0.3149606 1.338583 2.141732
## 4 2.500000 2.416667 3.069444 1.180556 0.4861111 1.361111 1.875000
## 5 2.232558 2.813953 3.046512 1.255814 0.4651163 1.558140 1.976744
## failures schoolsup famsup paid activities nursery higher
## 1 0.31756757 0.22972973 0.6216216 0.5135135 0.4797297 0.7837838 0.9459459
## 2 0.40000000 0.20000000 0.6000000 0.4000000 0.4000000 0.8000000 0.8000000
## 3 0.05511811 0.03937008 0.5748031 0.4803150 0.5669291 0.8188976 0.9842520
## 4 0.52777778 0.12500000 0.6527778 0.4722222 0.4583333 0.8333333 0.9583333
## 5 0.88372093 0.04651163 0.6279070 0.1860465 0.5348837 0.6976744 0.8604651
## internet romantic famrel freetime goout Dalc Walc
## 1 0.7770270 0.2702703 3.993243 3.195946 3.114865 1.500000 2.358108
## 2 1.0000000 0.8000000 4.200000 2.200000 2.600000 1.400000 2.000000
## 3 0.8661417 0.2598425 4.031496 3.307087 2.937008 1.354331 2.047244
## 4 0.8888889 0.4583333 3.722222 3.347222 3.333333 1.763889 2.763889
## 5 0.8139535 0.5116279 3.860465 3.093023 3.279070 1.325581 2.023256
## health absences G1 G2 G3
## 1 3.594595 3.3175676 9.236486 9.594595 9.7162162
## 2 3.600000 52.6000000 10.200000 10.000000 9.4000000
## 3 3.496063 3.0472441 14.488189 14.582677 14.8346457
## 4 3.486111 15.1944444 10.236111 9.805556 9.9583333
## 5 3.697674 0.4651163 7.302326 4.744186 0.6511628
##
## Clustering vector:
## [1] 1 5 4 3 1 3 1 1 3 3 1 1 3 1 3 3 3 1 4 1 3 3 3 3 1 4 1 3 1 4 1 3 3 1 3
## [36] 1 3 3 1 3 4 1 3 1 4 1 4 3 3 1 3 3 1 1 1 1 3 3 1 3 1 1 1 1 1 3 3 1 1 3
## [71] 3 1 5 3 2 1 1 1 1 4 1 1 4 3 1 1 1 3 4 4 1 3 1 1 3 1 3 1 3 1 4 3 3 4 3
## [106] 4 1 3 1 3 3 1 1 3 1 3 3 3 4 3 3 3 3 4 1 3 1 1 5 3 5 5 4 4 5 5 5 5 3 3
## [141] 5 1 1 3 5 1 5 1 5 1 5 3 1 5 1 1 3 1 3 1 5 1 5 1 1 4 1 3 5 3 5 3 1 5 1
## [176] 1 3 5 4 1 4 3 3 2 4 4 1 3 1 1 4 1 4 1 3 3 3 1 4 1 3 1 1 4 1 4 1 4 1 1
## [211] 4 4 3 4 4 3 4 4 1 1 5 5 3 3 3 4 3 1 4 4 4 1 4 3 4 4 3 4 1 5 4 1 5 3 5
## [246] 3 3 4 5 3 1 1 1 1 1 1 3 4 3 5 4 1 3 1 5 3 1 1 4 5 4 3 1 3 1 1 2 4 4 1
## [281] 4 4 1 1 1 1 3 3 3 3 4 3 4 3 3 3 5 4 3 3 4 1 3 3 4 3 3 2 3 4 5 4 1 4 4
## [316] 2 5 1 1 1 4 4 1 3 3 1 3 1 1 3 1 3 5 5 5 4 4 5 3 1 1 5 3 5 1 3 3 1 3 4
## [351] 1 3 1 1 1 1 3 1 1 3 3 3 1 3 1 1 3 5 1 4 1 3 1 4 3 1 3 1 3 4 3 1 1 5 4
## [386] 1 1 5 1 5 4 3 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 5274.662 999.600 4976.913 4895.889 1716.233
## (between_SS / total_SS = 65.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
p5 <- fviz_cluster(math.k5,
data = math.num)
p5
math.k.plot <- math.k.plot %>%
mutate(cluster5 = as.factor(math.k5$cluster))
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
geom_jitter(aes(color = cluster5)) +
facet_grid(~Mjob) +
xlab("Mother's Education Level") +
ylab("Final Grade") +
ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
geom_hline(yintercept = 10) +
annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust = -1.5) +
annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
theme_light()
Cluster 1 has students that scored 11 or higher. This is the majority of students who earned a passing score on G3. Cluster 3 has those students that scored 0 on G3, suggesting that they did not take the final exam. Cluster 2 has teh largest variance of scores, from 5 to 18. Cluster 4 has those middling students (scores between 6 and 13) who would likely pass the exam with extra help. Cluster 5 has less than 10 students.
The majority of students in Cluster 1 passed the G3 exam. What do they have in common?
clust.1.math.k5 <- math.k.plot %>%
filter(cluster5 == "1")
ggplot(data = math[clust.1.math.k5$id, ], aes(x = Medu, y = G3)) +
geom_jitter(aes(color = Mjob)) +
facet_grid(~studytime) +
ggtitle("Cluster 1: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
labs(x = "Education Level",
y = "G3 Grade") +
geom_hline(yintercept = 9) +
annotate(geom="text", label = "Pass", x = 0.75, y = 9.5, vjust = -0.5) +
annotate(geom="text", label = "Fail", x = 0.75, y = 8.5, vjust = 1) +
theme_light()
It looks like those students that have final scores between 6 and 11 should put in more study time. There are more see through dots than solid ones.
clust.2.math.k5 <- math.k.plot %>%
filter(cluster5 == "2")
ggplot(data = math[clust.2.math.k5$id, ], aes(x = Medu, y = G3)) +
geom_jitter(aes(color = Mjob)) +
facet_grid(~studytime) +
ggtitle("Cluster 2: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
labs(x = "Education Level",
y = "G3 Grade") +
geom_hline(yintercept = 9) +
annotate(geom="text", label = "Pass", x = 2.2, y = 9.5, vjust = -0.5) +
annotate(geom="text", label = "Fail", x = 2.2, y = 8.5, vjust = 1) +
theme_light()
clust.3.math.k5 <- math.k.plot %>%
filter(cluster5 == "3")
ggplot(data = math[clust.3.math.k5$id, ], aes(x = Medu, y = G3)) +
geom_jitter(aes(color = Mjob)) +
facet_grid(~studytime) +
ggtitle("Cluster 3: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
labs(x = "Education Level",
y = "G3 Grade") +
geom_hline(yintercept = 9) +
annotate(geom="text", label = "Pass", x = 0.75, y = 9.5, vjust = -0.5) +
annotate(geom="text", label = "Fail", x = 0.75, y = 8.5, vjust = 1) +
theme_light()
clust.4.math.k5 <- math.k.plot %>%
filter(cluster5 == "4")
ggplot(data = math[clust.4.math.k5$id, ], aes(x = Medu, y = G3)) +
geom_jitter(aes(color = Mjob)) +
facet_grid(~studytime) +
ggtitle("Cluster 4: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
labs(x = "Education Level",
y = "G3 Grade") +
geom_hline(yintercept = 9) +
annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -0.5) +
annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 1) +
theme_light()
clust.5.math.k5 <- math.k.plot %>%
filter(cluster5 == "5")
ggplot(data = math[clust.5.math.k5$id, ], aes(x = Medu, y = G3)) +
geom_jitter(aes(color = Mjob)) +
facet_grid(~studytime) +
ggtitle("Cluster 5: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
labs(x = "Education Level",
y = "G3 Grade") +
geom_hline(yintercept = 9) +
annotate(geom="text", label = "Pass", x = 1.1, y = 9.5, vjust = -0.2) +
annotate(geom="text", label = "Fail", x = 1.1, y = 8.5, vjust = 1) +
theme_light()
math.nb <- math.num %>%
mutate(G1_bin = ifelse(
G1 < 10, "F", ifelse(
G1 < 12, "D", ifelse(
G1 < 14, "C", ifelse(
G1 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G2_bin = ifelse(
G2 < 10, "F", ifelse(
G2 < 12, "D", ifelse(
G2 < 14, "C", ifelse(
G2 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G3_bin = ifelse(
G3 < 10, "F", ifelse(
G3 < 12, "D", ifelse(
G3 < 14, "C", ifelse(
G3 < 16, "B", "A" ) ) ) ) )
main.var <- c("G3_bin", "G2_bin", "G1_bin", "absences", "goout", "Medu", "Walc" ,"reason","freetime", "famrel", "health", "age", "Mjob", "Fjob", "Fedu", "studytime")
math.nb <- math.nb %>%
select(main.var)
preProc <- preProcess(math.nb, method = c("range","nzv"))
math.nb.scale <- predict(preProc, newdata=math.nb)
math.nb.scale <- math.nb.scale %>% mutate_if(is.character, as.factor)
set.seed(19)
inTraining <- createDataPartition(math.nb.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.nb.scale[inTraining,]
math.test <- math.nb.scale[-inTraining,]
set.seed(19)
nb.model1 <- naiveBayes(G3_bin ~., data = math.train)
Predict with default model
set.seed(19)
nb.model.pred <- predict(nb.model1, newdata = math.test, type = "class")
caret::confusionMatrix(data = nb.model.pred,
reference = math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D F
## A 11 1 0 0 0
## B 1 13 2 0 0
## C 0 4 11 4 0
## D 0 0 5 20 3
## F 0 0 0 6 36
##
## Overall Statistics
##
## Accuracy : 0.7778
## 95% CI : (0.6916, 0.8494)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7085
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: F
## Precision 0.91667 0.8125 0.57895 0.7143 0.8571
## Recall 0.91667 0.7222 0.61111 0.6667 0.9231
## F1 0.91667 0.7647 0.59459 0.6897 0.8889
## Prevalence 0.10256 0.1538 0.15385 0.2564 0.3333
## Detection Rate 0.09402 0.1111 0.09402 0.1709 0.3077
## Detection Prevalence 0.10256 0.1368 0.16239 0.2393 0.3590
## Balanced Accuracy 0.95357 0.8460 0.76515 0.7874 0.9231
set.seed(19)
nb.train.control <- trainControl(method = "cv", number = 3)
nb <- train(G3_bin~.,
math.train,
method = "naive_bayes",
trControl = nb.train.control)
plot(nb)
nb.grid <- expand.grid(laplace = seq(1, 3, 1),
usekernel = TRUE,
adjust = seq(1, 3, 1))
nb.tune <- train(G3_bin~.,
math.train,
method = "naive_bayes",
tuneGrid = nb.grid,
trControl = nb.train.control)
plot(nb.tune)
nb.tune.pred <- predict(nb.tune, newdata=math.test)
confusionMatrix(nb.tune.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D F
## A 0 0 0 0 0
## B 1 5 0 0 0
## C 1 4 12 4 1
## D 10 9 6 22 9
## F 0 0 0 4 29
##
## Overall Statistics
##
## Accuracy : 0.5812
## 95% CI : (0.4864, 0.6718)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 3.418e-08
##
## Kappa : 0.4389
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: F
## Precision NA 0.83333 0.5455 0.3929 0.8788
## Recall 0.0000 0.27778 0.6667 0.7333 0.7436
## F1 NA 0.41667 0.6000 0.5116 0.8056
## Prevalence 0.1026 0.15385 0.1538 0.2564 0.3333
## Detection Rate 0.0000 0.04274 0.1026 0.1880 0.2479
## Detection Prevalence 0.0000 0.05128 0.1880 0.4786 0.2821
## Balanced Accuracy 0.5000 0.63384 0.7828 0.6713 0.8462
SVM can take numeric/nominal variables
#### set up Pass/Fail classes
math.svm <- math.num %>%
mutate(G1_bin = ifelse(
G1 < 10, "Fail", "Pass")) %>%
mutate(G2_bin = ifelse(
G2 < 10, "Fail", "Pass")) %>%
mutate(G3_bin = ifelse(
G3 < 10, "Fail", "Pass"))
math.svm <- math.svm %>%
select(-c(G1,G2,G3))
preProc <- preProcess(math.svm, method = c("scale","nzv"))
math.svm.scale <- predict(preProc, newdata=math.svm)
math.svm.scale <- math.svm.scale %>% mutate_if(is.character, as.factor)
math.svm <- math.assoc %>%
select(-c(G1,G2,G3))
set.seed(19)
inTraining <- createDataPartition(math.svm.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.svm.scale[inTraining,]
math.test <- math.svm.scale[-inTraining,]
set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "linear")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 38 4
## Pass 1 75
##
## Accuracy : 0.9576
## 95% CI : (0.9039, 0.9861)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.54e-14
##
## Kappa : 0.9061
##
## Mcnemar's Test P-Value : 0.3711
##
## Precision : 0.9048
## Recall : 0.9744
## F1 : 0.9383
## Prevalence : 0.3305
## Detection Rate : 0.3220
## Detection Prevalence : 0.3559
## Balanced Accuracy : 0.9619
##
## 'Positive' Class : Fail
##
set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "polynomial")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, as.factor(math.test$G3_bin), mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 6 2
## Pass 33 77
##
## Accuracy : 0.7034
## 95% CI : (0.6123, 0.7839)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 0.2486
##
## Kappa : 0.1609
##
## Mcnemar's Test P-Value : 3.959e-07
##
## Precision : 0.75000
## Recall : 0.15385
## F1 : 0.25532
## Prevalence : 0.33051
## Detection Rate : 0.05085
## Detection Prevalence : 0.06780
## Balanced Accuracy : 0.56426
##
## 'Positive' Class : Fail
##
set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "radial")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, as.factor(math.test$G3_bin), mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 28 3
## Pass 11 76
##
## Accuracy : 0.8814
## 95% CI : (0.809, 0.9336)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 9.905e-08
##
## Kappa : 0.7172
##
## Mcnemar's Test P-Value : 0.06137
##
## Precision : 0.9032
## Recall : 0.7179
## F1 : 0.8000
## Prevalence : 0.3305
## Detection Rate : 0.2373
## Detection Prevalence : 0.2627
## Balanced Accuracy : 0.8400
##
## 'Positive' Class : Fail
##
# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv", number = 3)
# Tune the model - find the optimal C
set.seed(19)
svmGrid <- expand.grid(C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmLinear", tuneGrid = svmGrid,
trControl = train_control)
# Visualize the tuning result
plot(svm.caret)
# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")
# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 37 5
## Pass 2 74
##
## Accuracy : 0.9407
## 95% CI : (0.8816, 0.9758)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.179e-12
##
## Kappa : 0.8685
##
## Mcnemar's Test P-Value : 0.4497
##
## Precision : 0.8810
## Recall : 0.9487
## F1 : 0.9136
## Prevalence : 0.3305
## Detection Rate : 0.3136
## Detection Prevalence : 0.3559
## Balanced Accuracy : 0.9427
##
## 'Positive' Class : Fail
##
# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv", number = 3)
# Tune the model
set.seed(19)
svmGrid <- expand.grid(degree=seq(2,5,by=1),scale= c(0.01,0.001) , C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmPoly", tuneGrid = svmGrid,
trControl = train_control)
# Visualize the tuning result
plot(svm.caret)
# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")
# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 38 5
## Pass 1 74
##
## Accuracy : 0.9492
## 95% CI : (0.8926, 0.9811)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.461e-13
##
## Kappa : 0.888
##
## Mcnemar's Test P-Value : 0.2207
##
## Precision : 0.8837
## Recall : 0.9744
## F1 : 0.9268
## Prevalence : 0.3305
## Detection Rate : 0.3220
## Detection Prevalence : 0.3644
## Balanced Accuracy : 0.9555
##
## 'Positive' Class : Fail
##
# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv", number = 3)
# Tune the model
set.seed(19)
svmGrid <- expand.grid(sigma=c(0.005, 0.01,0.001) , C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmRadial", tuneGrid = svmGrid,
trControl = train_control)
# Visualize the tuning result
plot(svm.caret)
# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")
# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 38 5
## Pass 1 74
##
## Accuracy : 0.9492
## 95% CI : (0.8926, 0.9811)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.461e-13
##
## Kappa : 0.888
##
## Mcnemar's Test P-Value : 0.2207
##
## Precision : 0.8837
## Recall : 0.9744
## F1 : 0.9268
## Prevalence : 0.3305
## Detection Rate : 0.3220
## Detection Prevalence : 0.3644
## Balanced Accuracy : 0.9555
##
## 'Positive' Class : Fail
##
Random Forest can take can take numeric/nominal variables
math.rf <- math.num %>%
mutate(G1_bin = ifelse(
G1 < 10, "F", ifelse(
G1 < 12, "D", ifelse(
G1 < 14, "C", ifelse(
G1 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G2_bin = ifelse(
G2 < 10, "F", ifelse(
G2 < 12, "D", ifelse(
G2 < 14, "C", ifelse(
G2 < 16, "B", "A" ) ) ) ) ) %>%
mutate(G3_bin = ifelse(
G3 < 10, "F", ifelse(
G3 < 12, "D", ifelse(
G3 < 14, "C", ifelse(
G3 < 16, "B", "A" ) ) ) ) )
math.rf <- math.rf %>%
select(-G1,-G2,-G3)
preProc <- preProcess(math.rf, method = c("scale","nzv"))
math.rf.scale <- predict(preProc, newdata=math.rf)
math.rf.scale <- math.rf.scale %>% mutate_if(is.character, as.factor)
set.seed(19)
inTraining <- createDataPartition(math.rf.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.rf.scale[inTraining,]
math.test <- math.rf.scale[-inTraining,]
set.seed(19)
rf.model <- randomForest(G3_bin~., data = math.train, ntree = 500)
rf.model.pred <- predict(rf.model, newdata=math.test)
confusionMatrix(rf.model.pred, math.test$G3_bin)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D F
## A 11 1 0 0 0
## B 1 12 2 0 0
## C 0 3 9 4 0
## D 0 1 7 20 3
## F 0 1 0 6 36
##
## Overall Statistics
##
## Accuracy : 0.7521
## 95% CI : (0.6638, 0.8273)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6731
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: F
## Sensitivity 0.91667 0.6667 0.50000 0.6667 0.9231
## Specificity 0.99048 0.9697 0.92929 0.8736 0.9103
## Pos Pred Value 0.91667 0.8000 0.56250 0.6452 0.8372
## Neg Pred Value 0.99048 0.9412 0.91089 0.8837 0.9595
## Prevalence 0.10256 0.1538 0.15385 0.2564 0.3333
## Detection Rate 0.09402 0.1026 0.07692 0.1709 0.3077
## Detection Prevalence 0.10256 0.1282 0.13675 0.2650 0.3675
## Balanced Accuracy 0.95357 0.8182 0.71465 0.7701 0.9167
# Feature analysis
varImpPlot(rf.model, n.var = 10)
# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv", number = 3)
# Tune the model - find the optimal k
set.seed(19)
rfGrid <- expand.grid(mtry = seq(1,10,by=1))
rf.caret <- train(G3_bin~., data = math.train, method="rf", tuneGrid = rfGrid,
trControl = train_control)
# Visualize the tuning result
plot(rf.caret)
# Validation
set.seed(19)
rf.pred<- predict(rf.caret, newdata = math.test, type="raw")
# Results
confusionMatrix(rf.pred, math.test$G3_bin)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D F
## A 12 1 0 0 0
## B 0 13 1 0 0
## C 0 4 12 4 0
## D 0 0 5 20 2
## F 0 0 0 6 37
##
## Overall Statistics
##
## Accuracy : 0.8034
## 95% CI : (0.7198, 0.8711)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7421
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: F
## Sensitivity 1.0000 0.7222 0.6667 0.6667 0.9487
## Specificity 0.9905 0.9899 0.9192 0.9195 0.9231
## Pos Pred Value 0.9231 0.9286 0.6000 0.7407 0.8605
## Neg Pred Value 1.0000 0.9515 0.9381 0.8889 0.9730
## Prevalence 0.1026 0.1538 0.1538 0.2564 0.3333
## Detection Rate 0.1026 0.1111 0.1026 0.1709 0.3162
## Detection Prevalence 0.1111 0.1197 0.1709 0.2308 0.3675
## Balanced Accuracy 0.9952 0.8561 0.7929 0.7931 0.9359
Random Forest can take can take numeric/nominal variables
math.rf <- math.num %>%
mutate(G1_bin = ifelse(
G1 < 10, "Fail", "Pass")) %>%
mutate(G2_bin = ifelse(
G2 < 10, "Fail", "Pass")) %>%
mutate(G3_bin = ifelse(
G3 < 10, "Fail", "Pass"))
math.rf <- math.rf %>%
select(-c(G1,G2,G3))
preProc <- preProcess(math.rf, method = c("scale","nzv"))
math.rf.scale <- predict(preProc, newdata=math.rf)
math.rf.scale <- math.rf.scale %>% mutate_if(is.character, as.factor)
set.seed(19)
inTraining <- createDataPartition(math.rf.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.rf.scale[inTraining,]
math.test <- math.rf.scale[-inTraining,]
rf.model <- randomForest(G3_bin~., data = math.train, ntree = 500)
rf.model.pred <- predict(rf.model, newdata=math.test)
confusionMatrix(rf.model.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 37 4
## Pass 2 75
##
## Accuracy : 0.9492
## 95% CI : (0.8926, 0.9811)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.461e-13
##
## Kappa : 0.8866
##
## Mcnemar's Test P-Value : 0.6831
##
## Precision : 0.9024
## Recall : 0.9487
## F1 : 0.9250
## Prevalence : 0.3305
## Detection Rate : 0.3136
## Detection Prevalence : 0.3475
## Balanced Accuracy : 0.9490
##
## 'Positive' Class : Fail
##
# Feature analysis
varImpPlot(rf.model, n.var = 15)
# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv", number = 3)
# Tune the model - find the optimal k
set.seed(19)
rfGrid <- expand.grid(mtry = seq(1,10,by=1))
rf.caret <- train(G3_bin~., data = math.train, method="rf", tuneGrid = rfGrid,
trControl = train_control)
# Visualize the tuning result
plot(rf.caret)
# Validation
set.seed(19)
rf.pred<- predict(rf.caret, newdata = math.test, type="raw")
# Results
confusionMatrix(rf.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Pass
## Fail 38 5
## Pass 1 74
##
## Accuracy : 0.9492
## 95% CI : (0.8926, 0.9811)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : 1.461e-13
##
## Kappa : 0.888
##
## Mcnemar's Test P-Value : 0.2207
##
## Precision : 0.8837
## Recall : 0.9744
## F1 : 0.9268
## Prevalence : 0.3305
## Detection Rate : 0.3220
## Detection Prevalence : 0.3644
## Balanced Accuracy : 0.9555
##
## 'Positive' Class : Fail
##
The largest indicators for a student’s performance at the end of the course are his or her performance in the previous two terms, G1 and G2. Other factors that have large impacts on these scores are the number of absences, the number of previously failed courses, the mother’s and/or father’s education level, and the job held by the mother.
If a student fails a course and does not have support from the school via tutoring or afterschool help, the student is likely to fail again. As the number of failures increases, the number of students below the Pass/Fail line on the plot increases.
Note that of the students that failed a course, the majority fall into the no School Support category. A factor that can have a large impact on each grade period is School Support.
Gather data from more students so that the models are better able to predict student performance using variables other than previous scores. ***